Complete chloroplast genomes and comparative analysis of Ligustrum species

In this study, we assembled and annotated the chloroplast (cp) genomes of four Ligustrum species, L. sinense, L. obtusifolium, L. vicaryi, and L. ovalifolium ‘Aureum’. Including six other published Ligustrum species, we compared various characteristics such as gene structure, sequence alignment, codon preference, and nucleic acid diversity, and performed positive-selection genes screening and phylogenetic analysis. The results showed that the cp genome of Ligustrum was 162,185–166,800 bp in length, with a circular tetrad structure, including a large single-copy region (86,885–90,106 bp), a small single-copy region (11,446–11,499 bp), and a pair of IRa and IRb sequences with the same coding but in opposite directions (31,608–32,624 bp). This structure is similar to the cp genomes of most angiosperms. We found 132–137 genes in the cp genome of Ligustrum, including 89–90 protein-coding genes, 35–39 tRNAs, and 8 rRNAs. The GC content was 37.93–38.06% and varied among regions, with the IR region having the highest content. The single-nucleotide (A/T)n was dominant in simple-sequence repeats of the Ligustrum cp genome, with an obvious A/T preference. Six hotspot regions were identified from multiple sequence alignment of Ligustrum; the ycf1 gene region and the clpP1 exon region can be used as potential DNA barcodes for the identification and phylogeny of the genus Ligustrum. Branch-site model and Bayes empirical Bayes (BEB) analysis showed that four protein-coding genes (accD, clpP, ycf1, and ycf2) were positively selected, and BEB analysis showed that accD and rpl20 had positively selected sites. A phylogenetic tree of Oleaceae species was constructed based on the whole cp genomes, and the results were consistent with the traditional taxonomic results. The phylogenetic results showed that genus Ligustrum is most closely related to genus Syringa. Our study provides important genetic information to support further investigations of the phylogenetic development and adaptive evolution of Ligustrum species.

There are approximately 50 Ligustrum (Oleaceae) species worldwide, mainly distributed in warm regions of Asia and extending northwest to Europe and south to New Guinea and Australia via Malaysia 1 . Among these, approximately 38 species are distributed in China, mainly in the south and southwest. This genus comprises evergreen, semi-evergreen, or deciduous trees and shrubs with opposite, simple leaves with papery or leathery blades 2 . Ligustrum species thrive in light and are slightly shade tolerant and relatively cold tolerant; their dense, pruning-tolerant branches have been used extensively as decorative hedging material with high ornamental value. Ligustrum species also have medicinal value; e.g., Ligustrum lucidum leaves can be distilled to extract wintergreen oil, which is used as an additive in foods and toothpaste. Its dried fruits are also used as the traditional Chinese medicine lucidum, which is cool and bittersweet, and brightens the eyes and hair and nourishes the liver and kidneys 3,4 . Ligustrum species also effectively adsorb atmospheric pollutants such as SO 2 and NO 2 and exhibit strong stress resistance, playing a positive role in purifying the air and improving regional ecological quality 5 . However, research on Ligustrum species has mainly focused on morphology, physiology, population characteristics, and pharmacological activity, with few studies investigating the molecular basis for germplasm identification, genetic breeding, resource conservation, and phylogenetics, which can affect the conservation and exploitation of Ligustrum species. Therefore, to elucidate the taxonomic relationships and positions of Ligustrum

Results
Chloroplast genome structures of Ligustrum species. The cp genomes of all four Ligustrum species were covalently closed double-stranded circular molecules, including a pair of sequences with the same coding but in the opposite orientation (IRa and IRb), one LSC region, and one SSC region. No deletions of large segments or regional bases were detected. The genome length ranged from 162,272 to 166,358 bp ( Fig. 1). There were heteroplasmy. When each species is compared with L. sinense, different SNPs will be obtained. The cp genome length of L. obtusifolium and L. sinense was 815 bp different, and there were 291 SNPs in total. The cp genome length of L. vercaryi and L. sinense was 3996 bp different, and there were 274 SNPs in total. The cp genome length of L. ovalifolium ' Aureum' and L. sinense differed by 4086 bp, with a total of 284 SNPs (Supplemental file-SNP). Although heteroplasmy exists, but there is little difference in the type and number of cp genes ( Table 1). The cp genomes of the four Ligustrum species are relatively conserved.
Next, the basic characteristics of the cp genomes of ten Ligustrum plants were evaluated. The total length of Ligustrum cp genomes ranged from 162,185 bp (L. vulgare) to 166,800 bp (L. ovalifolium). The length of the LSC region ranged from 86,885 bp (L. sinense) to 90,106 bp (L. ovalifolium); the SSC region length ranged from 11,446 bp (L. ovalifolium, L. ovalifolium ' Aureum') to 11,499 bp (L. gracile), the IR region length ranged from 31,608 bp (L. vulgare) to 32,624 bp (L. ovalifolium), the coding region length ranged from 84,903 bp (L. vicaryi) to 89,070 bp (L. ovalifolium), and the non-coding region length ranged from 75,662 bp (L. vulgare) to 81,365 bp (L. vicaryi) ( Table 1). A total of 132-137 cp genes were detected, comprising 89-90 protein-coding genes, 35-39 tRNA genes, and 8 rRNA genes. GC content differed among positions within the cp genomes, and also different among genes coding different functions, with generally higher GC content in the gene-coding region (38.00-38.22%) than in the non-coding region (37.70-37.91%); GC content was highest in the IR region (41.16-41.40%), followed by the LSC region (36.17-36.33%) and SSC region (32.68-32.81%). The rRNA GC content of the entire coding region was 55.22-55.37%; the total GC content (37.93-38.06%) was lower than that in the IR region but higher than those in the SSC and LSC regions. Among protein-coding sequences, GC content was higher in the first than in the second and third (Fig. 2).
Duplicate genes were counted only once; thus, a total of 114 genes were annotated in the cp genomes of ten Ligustrum species, including 82 protein-coding genes, 4 rRNA genes, and 28 tRNA genes ( Table 2). Introns play an important role in gene expression regulation. A total of 22 genes in the cp genomes of ten Ligustrum species contained introns, among which the genes ndhA, ndhB, petB, petD, atpF, rpl2, rpl16, rps12, rps16, rpoC1, accD, trnA-UGC , trnG-GCC , trnG-UCC , trnI-GAU , trnL-CAA , trnL-UAA , trnL-UAG , trnV-GAC , and trnV-UAC each contained one intron, and ycf3 and clpP each contained two introns. Only the accD gene of L. obtusifolium and L. vicaryi, contained one intron, whereas the accD gene of all other Ligustrum species had no introns; similarly, the trnV gene of L. sinense, L. obtusifolium, L. vicaryi, and L. ovalifolium ' Aureum' contained one intron, and the trnV gene of all other Ligustrum species had no introns. Gene intron loss occurs during the evolution of Ligustrum species (Supplementary Table 1).

IR contraction and expansion.
The cp genome is a ring structure consisting of the LSC, SSC, IRa, and IRb regions, with four boundaries: LSC-IRb, IRb-SSC, SSC-IRa, and IRa-LSC. Expansion and contraction of the IR region of the cp genome is an important event in plant evolutionary history and causes changes in the size and gene content of the cp genome. In this study, we compared the LSC/IRb/SSC/IRa boundaries of cp genomes from ten Ligustrum species (Fig. 4). The genotypes of the IR-LSC and IR-SSC boundaries were essentially the same, with relatively conserved IR lengths among all ten species (31,608-32,624 bp) and no significant amplification or contraction events. The IR-SC boundary differed among the cp genomes of the ten Ligustrum species; seven protein-coding genes (rps19, rpl2, ndhH, ndhF, ndhA, rpl22, and trnH) were present at the LSC-IR and SSC-IR boundaries. The LSC-IRb boundary of L. lucidum was located between trnH and rpl2, with trnH located 14 bp to the left and rpl2 located 59 bp to the right. In all other species, the LSC-IRb boundary was located between rps19 and rpl2; in the other species, the LSC-IRb boundary extended into rps19 with a 1-2 bp length variation, except for that of L. vulgare, which was immediately adjacent to rps19. In L. obtusifolium, L. sinense, and L. vicaryi, ndhH was 1 bp to the left of the IRb-SSC boundary; in the other species, the IRb-SSC boundary extended into ndhH, with a length variation of 22-98 bp. The IRb-SSC boundary extended into ndhF by 26 bp in L. ovalifolium ' Aureum' and L. ovalifolium, was immediately adjacent to ndhF in L. obtusifolium and L. quihoui, and was located 4-10 bp to the right of ndhF in the other Ligustrum species. The SSC-IRa boundary of all www.nature.com/scientificreports/ Ligustrum species extended into ndhH, with a length variation of 74-83 bp; the ndhA gene was located 56-84 bp to the left of this boundary. The IRa-LSC boundary of L. lucidum was between rpl2 and trnH, with rpl2 located at a distance of 59 bp; rpl22 was located 500 bp to the right of the IRa-LSC boundary. In the other Ligustrum species, the IRa-LSC boundary was between rpl2 and trnH; rpl2 was located 58-63 bp to the left of the IRa-LSC boundary and trnH was located 13-15 bp to the right of the IRa-SSC.  www.nature.com/scientificreports/ Table 2. List of genes annotated in the chloroplast genomes of ten Ligustrum species in this study. *Gene contains one intron; **Gene contains two introns; (× 2) indicates the number of the repeat unit is 2.  www.nature.com/scientificreports/ Repeat sequence analysis and simple sequence repeats (SSRs). Because SSRs have high polymorphism rates at the species level, they have become an important source of molecular markers, and have been extensively investigated in phylogenetic and population genetics studies. In this study, SSRs were mainly distributed in the LSC and SSC regions of the cp genome (Fig. 5A), which are also major cp distribution regions, with few SSRs in the two IR regions. According to SSR location analysis, most were distributed in the non-coding regions of the genome, i.e., the intergenic and intronic regions (Fig. 5B). A total of 164 (L. gracile, L. lucidum, L. japonicum, and L. vulgare) to 170 (L. obtusifolium) SSRs were detected in the cp genomes of Ligustrum species, which had the largest number of single nucleotides (140-155), dinucleotides (3-6), trinucleotides (5-13), tetranucleotides (2-4), pentanucleotides (1-3), and hexanucleotides (1-4) (Fig. 5C). Single nucleotide repeats may play a more important role in gene variation than other types of SSRs. These SSRs were dominated by single nucleotide (A/T)n (Fig. 5G), suggesting that the base composition of SSRs is biased toward A/T bases. Long repetitive sequences (≥ 30 bp) may promote cp genome rearrangement and increase the function of species genetic diversity. A total of 223 (L. sinense) to 1,062 (L. ovalifolium) long repeat sequences were predicted in the Ligustrum cp genomes, including 142-862 forward repeats, 1-8 reverse repeats, 1-8 complementary repeats, and 40-194 palindromic repeats (Fig. 5D). The largest number of long repeats was found to have a length of 30-34 bp, and the smallest had a length of 65-69 bp (Fig. 5E). Among these, L. ovalifolium ' Aureum' had the highest number of long repeat sequences (Fig. 5F). We also detected 44 (L. vulgare) to 88 (L. ovalifolium) tandem repeats.
Comparative genomic divergence and hotspot regions. To determine the sequence differences among the ten Ligustrum cp genomes, we used L. sinense as a reference genome and compared them using the mVISTA software. Ligustrum cp whole-genome sequences encoded gene classes, numbers, and alignments that were highly consistent among species. Variation among sequences occurred mainly in non-coding intergenic regions, and coding regions were generally more conserved (Fig. 6).
Next, we calculated the nucleotide diversity (Pi) of the ten Ligustrum species. The high-variation regions of the Ligustrum cp genomes were mainly concentrated in the LSC and IR regions. Six regions, i.e., one intergenic  clpP1-exon3, clpP1-exon2, ycf1, and ycf1), were considered as hotspot regions (Pi > 0.06), among which gene region accD had the highest nucleotide diversity (0.2552083), followed by the intergenic region rbcL_accD (0.172619) (Fig. 7, Table 3). Four of these hotspot regions were located www.nature.com/scientificreports/ in the LSC region and two in the IR region. Further analysis of the six hotspot regions showed that rbcL_accD intergene region included a large number of insertion and deletion events. There were large fragment deletion and intron loss in accD gene, resulting in large sequence difference and difficult sequence alignment. Therefore, it is not recommended as a candidate DNA barcode for the Ligustrum. However, the ycf1 gene region and the clpP1 exon region not only have high sequence variability, but also are coding region sequences, which can be accurately corrected by triplet codons. Therefore, the ycf1 gene region and the clpP1 exon region can be used as potential DNA barcodes for the identification and phylogeny of the Ligustrum.

Pairwise comparison of species Ka/Ks ratios and positive selection analyses. The Ka/Ks ratios
of Ligustrum species were calculated to provide information on the selection pressure acting on individual sequences. Of the ten Ligustrum species, L. lucidum, L. gracile, and L. quihoui had higher Ka/Ks ratios (Fig. 8).
Positive selection analyses of 78 single-copy protein-coding sequence genes from the ten Ligustrum species revealed four protein-coding genes (accD, clpP, ycf1, and ycf2) subject to significant positive selection (P < 0.05). Bayes empirical Bayes (BEB) analysis revealed significant posterior probabilities for the accD and rpl20 genes, with 49 positive selection sites for accD and four for rpl20 (Supplementary Table 2).
Phylogenetic results. We applied a maximum likelihood (ML) model to construct a phylogenetic tree of 37 species belonging to 13 genera in Oleaceae. The relationships among the genera in this family were well handled, and the 13 genera clustered into one branch with high support for each node, which was consistent with the botanical classification (Fig. 9). Ligustrum species clustered into a single monophyletic clade, with high support. The European species L. vulgare was the first to differentiate. Ligustrum vicaryi, L. ovalifolium ' Aureum' , and L. ovalifolium formed one branch, and L. obtusifolium formed another. Ligustrum sinense and L. quihoui clustered together, and Ligustrum and Syringa were more closely related than other genera in Oleaceae.  Excluding duplicate genes, a total of 114 genes were annotated to the cp genomes of the ten Ligustrum species, containing 82 protein-coding genes, 4 rRNA genes, and 28 tRNA genes. Among these, the accD gene contained one intron in L. obtusifolium and L. vicaryi but no introns in the other Ligustrum species; similarly, the trnV gene contained one intron in L. sinense, L. obtusifolium, L. vicaryi, and L. ovalifolium ' Aureum' but no introns in the other Ligustrum species. We assume that the www.nature.com/scientificreports/ loss of introns in Ligustrum species occurred during the evolutionary process. To some extent, intron loss reflects the rate of species evolution, with faster-evolving species retaining fewer ancestral introns 19,20 . Thus, plant evolution can be detected through the loss of intron polymorphisms and introns of the same gene within a species.
Candidate DNA barcoding of genus Ligustrum. DNA barcoding technology has a wide range of applications in the fields of species identification, resource conservation, phylogeny, and evolution 21,22 . The cp genomes of Ligustrum species are generally consistent in overall gene content and arrangement. However, comparative genome analysis using mVISTA revealed relatively conserved sequences among Ligustrum species. Compared to the LSC and SSC regions, the sequence divergence of the IR regions was slower and the comparative conservation was due to the replication correction caused by higher gene conversion between the sequences of the two IR regions 23 . Single-copy regions have higher nucleotide diversity than IR regions, and non-coding regions have higher nucleotide diversity than coding regions, which is consistent with results from other taxa 24 . Nucleotide diversity analysis identified six highly variable regions, which were mainly located in non-coding regions. The highly variable accD gene sequence identified in this study was also previously identified as the most highly variable hotspot region in Quercus 25 , and the ycf1 gene was also reported as a highly variable hotspot region in Papaveraceae 26 ; therefore, the highly variable hotspots regions identified in this study have potential as candidate markers or DNA barcodes for inferring the phylogeny of Ligustrum. Further analysis of the six hotspot regions showed that the ycf1 gene region and the clpP1 exon region not only have high sequence variability, but also are coding region sequences, which can be accurately corrected by triplet codons. Therefore, it is more recommended that the ycf1 gene region and the clpP1 exon region be used as potential DNA barcodes for the identification and phylogeny of the Ligustrum. Jin et al. has also been found that ycf1a and ycf1b were two specific DNA barcodes of Ligustrum 17 . In this study we found that besides ycf1 gene, the clpP1 exon region can also be used as the candidate DNA barcode for Ligustrum.
Phylogenetic tree. Using the cp genome data obtained in this study and those published for four additional species, we constructed a phylogenetic tree based on the whole cp genomes for 13 genera and 37 species of Oleaceae. Species of Ligustrum and Syringa have highly similar morphology, which can affect the discovery and identification of their fossils 27 . It is of great significance to study the relationship and taxonomic status between Ligustrum and Syringa. Based on internal and external transcribed spacer results on the rDNA of Syringa 28 , Ligustrum may have originated from Syringa according to rps1 and trnL-F sequence analysis, such that Syringa is a syntaxon 29 . There is also study based on cp genomes showed that Ligustrum is a monophyletic group through phylogenetic analysis, while Syringa is a paraphyletic group, and Ligustrum shows the characteristics of a suspected subclass of Syringa 30 . In this study, an ML phylogenetic tree was constructed using the whole cp genomes.
The phylogenetic results showed that Ligustrum and Syringa were clustered together and closely related. The result supports the view that Ligustrum is a monophyletic group, and Syringa is a syntaxic group, and that Ligustrum may originated from Syringa. However, compared with more than 50 species of Ligustrum 1 and nearly 30 species of Syringa 31 , relatively few species have been subjected to complete cp genome sequencing. Therefore, the relationship and taxonomic status between Ligustrum and Syringa requires redefinition and further investigation using more genomic data. Genome assembly and annotation. Total DNA was extracted from leaves using a plant DNA extraction kit, and the quality, integrity, and concentration of DNA were determined by agarose gel electrophoresis and spectrophotometry. To obtain high-quality clean reads, quality control of the raw reads data obtained from sequencing was performed using the Trimmomatic v0.39 software 33 to remove low-quality sequences and junctions. Chloroplast genome assembly was performed using the NOVOPlasty v4.3 software (https:// github. com/ ndier ckx/ NOVOP lasty) 34 . Sequences with sufficiently high coverage depth and long assembly length were selected as candidate sequences, and cp scaffolds were confirmed by comparison with the NT library and overlapped sequences. Validated the assembly results by mapping reads to the assembled sequence and show the results in the Supplemental Figure, and the specific depth results has be placed in the Supplemental file-Depths. BLAST searching 35 was performed to compare the assembled sequences with cp reference genome sequences of the proximal species (L.quihoui, NC_057246.1) to determine the initial position and orientation of the cp assembly sequence and determine the possible cp partitioning structure (LSC/IR/SSC) to obtain the final cp genome sequence. The GeSeq software 32 was used to predict the cp genome for coding proteins, tRNA, and rRNA genes, and then the predicted initial genes were made de-redundant and the first and last genes and exon/intron boundaries were manually corrected to obtain a highly accurate gene set. Finally, the Chloroplot software (https:// irsco pe. shiny apps. io/ Chlor oplot/) 36  SSRs and repeat sequence analysis. SSRs in the cp genomes of ten Ligustrum species were analyzed using the MISA software 38 , with the parameters 1-8, 2-5, 3-4, 4-3, 5-3, and 6-3, such that there were no fewer than eight mononucleotide repeats, no fewer than five dinucleotide repeats, and no fewer than four trinucleotide repeats, and there were at least three tetranucleotide, pentanucleotide, and hexanucleotides repeats. The REPuter software 39 48 , where positively selected genes were evaluated at a level of P < 0.05. Finally, the BEB method was used to calculate the posterior probabilities of amino acid sites to determine whether the sites were positively selected.

Materials and methods
Phylogenomic analysis. Complete cp genome sequences of Oleaceae, particularly Ligustrum, were selected from the NCBI database, and phylogenetic analysis was performed using the ten Ligustrum species examined in this study, and 27 other Oleaceae species. S. henryi (NC_036943.1) and L. galeobdolon (NC_036972.1) were selected as outgroups. The complete cp genome sequences were used for tree construction. They were extracted and aligned using MAFFT v7.458 49 , and the alignment was trimmed by Gblocks_0.91b 50 to remove lowquality regions with the parameters: -t = d -b4 = 5 -b5 = h. ML phylogenetic tree based on the best-fit model of GTR + I + G was conducted using PhyML 3.0 (http:// www. atgc-montp ellier. fr// phyml/) 51 . The Best-fit model by jModelTest 2.1.10 52 , according to Bayesian information criterion (BIC) and the robustness of the topology was estimated using 1000 bootstrap replicates.

Conclusion
In this study, the cp genomes of four Ligustrum species were assembled and annotated, and a series of characteristic analyses were performed using six additional published Ligustrum species. The results showed that the cp genome of Ligustrum species has a tetrad structure, with similar and conserved genome structures and gene numbers. The total length of the cp genome was 162,185-166,800 bp, and the GC content ranged from 37.93 to 38.06%. Six hotspot regions were identified from multiple sequence alignment of Ligustrum; the ycf1 gene region and the clpP1 exon region can be used as potential DNA barcodes for the identification and phylogeny of the genus Ligustrum. The identification of four positive-selection genes in this study will contribute to our understanding of the adaptation of Ligustrum species to the environment. Based on the whole cp genomes, we constructed an evolutionary tree of Oleaceae species, which showed that 13 genera in Oleaceae were clustered into one branch, each node having a high support rate, and Ligustrum and Syringa were the most closely related groups. Through sequencing and analysis of the cp genomes of Ligustrum species, the results of this study provide a basis for identifying and elucidating the phylogenetic relationships of Ligustrum species.
Specimen collection. The plant material was collected with the owner's permission and in accordance with relevant guidelines and regulations.

Data availability
The original contributions provided in the study are publicly available and can be found at NCBI (SRR21590286, SRR21590285, SRR21590284, SRR21590283). www.nature.com/scientificreports/